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English as a Foreign Language (TOEFL). Using all of the information 
provided by various responses to the test's items (the four 
alternatives, omitted, and not reached), the items* interrelations 
were analyzed by thre«i-way multidimensional scaling for samples of 
examinees systematically varying in native language and level of 
English proficiency. Four dimensions were identified: three 
corresponded to the sections of the test, and the fourth was an 
end-of-test phenomenon. The dimensions were predominantly defined by 
easy items and were most salient for low-scoring examinees. Native 
language had little influence on results. The major conclusions were 
that the TOEFL* s construct validity is supported, the testes 
interpretation varies with the examinees* English proficiency, easy 
and difficult items differ in their potential for diagnosis and 
global screening, and the dimensionality of the TOEFL and of 
competence in English depends on the examinees* English proficiency. 
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Abstract 



The aim of this study was to appraise the effect of native language and level of 
English proficiency on the structure of the TOEFL^\ The interrelations amrng TOEFL 
items, using all of the information provided by the various responses to the items (the four 
alternatives, omitted, and not reached), were analyzed by three*way multidimensional scaling 
for samples of exammees systematicaUy varying in native language and level of En^ish 
proficiency. Four dimensions were identified: three corresponded to the sections of the 
test, and the fourth was an end-of-test phenomenon. The dimensions were predominantly 
defined by easy items and were most salient for low-scoring examinees. Native language 
had little influence on the results. Major conclusions were that the TOEFL's construct 
validity is supported, the test's mterpretation varies with the examinees' En^ish proficiency, 
easy and difficult items differ in their potential for diagnosis and globa! screening, and the 
dimensionality of the TOEFL and of competence in English depends on the examinees' 
English proficiency. 



The Test of En^ish as a For ejgn Language (TOEFL; Educational Testing Service, 
1985) consists of three sections, Listening Comprehension, Structure and Written 
Eiqpression, and Vocabulary and Reading Comprehension, and provides scores for each 
section as well as a total score. The test is intended to assess the ability of nonnative 
speakers to understand spoken English, to comprehend reading materi^ds, and to recognize 
correct structural, grammatical, and lexical usage. 

Responses on the TOEFL may reflect both the influence of the examinees' native 
language and their level of En^ish prc^dency. Work thus far has not appraised the 
independent influences of native lai^guage and level of English proficiency on TOEFL 
perfmnance, and these variables are confounded in most of this research. 

The purpose of this study was to appraise the influence of examinees' native 
language and level of English proficiency on the structure of the TOEFL. More specifically, 
the aim was to assess the interrelations among TOEFL items for groups of ^[aminees diat 
qrstematically varied in native language and 1^1 of Eng^ prc^dency, going beycmd the 
usual right versus wrong scoring to use all the information provided by the various responses 
to the items. 

Method 

Rramifift^ and Tcst Form 

The data were drawn from the 53,169 examinees who lock the TOEFL in the May 
1985 international admuiistration and had complete information. The form had 146 
operational items. Twenty-one subsamples of examinees, comprising seven language groups 
(Arabic, Chinese, Greek, Japanese, Kcvean, Malay, and Spanish) and three levels of 
performance cm the TOEFL (Hi^**total scores on the TOEFL (rf 543 and above; Medium- 
scores of 483 to S4(^, and Low-scores of 480 and below) were randomly drawn from the 
total sample. All language groups with approximately 400 or more examinees at each oS the 
three performance levels were included. (The three levels were determmed by 
trichotomizing the score distribution for the total sample.) Each subsample consisted of 400 
ocaminees, except for 397 in the low-scoring Greek subsample. 

Analysis 

For each of the 21 subsamples of examinees, a 146 x 146 matrix of symmetrical tau 
coefficients (Goodman & Kruskal, 1954; Jacobsra, 1976) among the items was computed. 
This coefficient, a measure oi association between two nominal variables, indicates (on a 
scale from 0 to 1) the extent to which one variable is predictable from the other, and vice 
versa. In this analysis, each item is a nominal variable with six categories (the four alterna- 
tives, omitted, and not reached), and the tau between a pair of items is computed from the 
resulting 6x6 contingency table. 

A three-way, metric multidimensional scaling analysis of the 21 tau matrices was 
carried out, using SINDSCAL (Pruzansky, 1975). Three-way scaling allows for variation 
among individuals in the salience of the (Umensions (the "individuals" in the present 
application are the 21 subsamples). 

The results of the scaling were subjected to two hierarchical cluster analyses (Ward, 
1963), one on the 146 items and one on the 21 subsamples, to identify regions in the 
multidimensional space n^ere items and subsamples formed groupings. 
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Results and Discussion 

Item Dimensions and Qusters 

Dimensions . Based on an examination of the variance accounted for and the 
inteq>retability of the dimensicms, the four-dimension SINDSCAL s'^lution was chosen. 

Figure 1 presents the items plotted for each pair of dimensions. I^ension I wcis 
defined by the items (all relatively difiBcult) for the last two Reading Comprehensicm 
passages, located at the very end of the test This dimenaon iq)pears to reflect the degree 
to which items were omitted or not reached: fw the total sampte, the items' coordinates on 
Dimension I correlated -.94 with the proportion of omitted responses and -JBS wiA the 
proportion of not readied responses. Dnnension II was defined by relatively easy Listening 
Comprdiension items; Dimension ID by easy Vocabulary and Reading Comprehension 
items; and Dimension IV by items in one of the two Reading Qmiprehension passages at 
the end of the test, at one pde, and easy Structure and Written Expression items, at the 
other pole. 

Thus, the easier items in each sectkm of the test defined three of the dimensions. 
An additional dimension was defined by difficult items associated with reading passages and 
appears to be an end-<^-test phenomenon. The remaining items contributed little to the 
emergence of any of the dimensions. 



See Figure 1 



Clusters. The tree diagram for the cluster analysis of items appears in Figure 2. 
Seven clusters were interpretaUe. The clusters consisted of (a) Reading Comprehension 
items fcM* the next to last passagp in the test, (b) Reading Comprehension items fi-om the 
last passage in the test, (c) easy Listening Comprehensira items, (d) easy Vocabulary and 
Reading Comfmhension items, (e) easy Structure and Written Expression items, (f) 
medium diffiralty Structure and Written Expression items, and (g) difficult items scattered 
throu^KNit the test-a kind of "general" cluster. 

An examination of the locations of the clusters on the plots of dimensions in Figure 
1 (the clusters are shown as ellipses, with the general cluster as a shaded ellipse), shows that 
the general cluster, unlike the others, was ahvays located at the center of each of the plots, 
indicating that its items did not define any of the dimensions. 



§ee Figure 2 



Subsamples austered bv Subject Weights 

Suhjert wft^ghtft Rgure 3 presents the subject weights for the language/level 
subsamples plotted for each pair of dimensions. The subject weights on aU tl^ dimensions 
were greater for the low-scoring subsamples, with the largest weights occurring for the low- 
scoring Arabic, Greek, Japanese, and Spanish subsamples on Dimension I. These results 
indicate that the dimensions were more salient for the low-scoring subsamples, and the end- 
of-test dimension (Dimension I) had greater salience for some of these subsamples, the only 
instance in which language group had an effect. 



ERLC 



5 



-3- 



Qm^ai* Three subsample clusters were interpretable; they are shown (as ellipses) 
in Figure 3. They consisted of (a) low-scoring Arabic, Greek, Japanese, and Spanish; (b) 
low-somng Chinese, Korean, and Malay, plus medium-scoring Spanish; and (c) the 
remaining subsam]^ (all medium* and U^-scoring). 

Inspection of the locations of these clusters on the plots of subject weights in Figure 
3 reveals that the two clusters of low-scoring subsamples differed primarily in 1^ salience of 
Dimension I. This difference occurred because the proportion of omitted and lOt reached 
responses was substantial^ greater for the cluster of low-scoring Arabic, Greek, Japanese, 
and Spanish than for the other low-scoring cluster or for the cluster of medium- and high- 
scoring examinees. 



See Figure 3 



Conclusions 

Item Difficultv 

The failure of the <fifficult items to contribute to defining the dimensions was 
unexpected Because examinees make more errors on difficult items, these items might be 
expected to be more likely to duster m ways that depend on errors. One conjecture is that 
difficult TOEFL items are noC univocal because they invoke a broad knowledge base, 
several distinct kinds of processes, or hig^r-level organizational or strategic drills that apply 
across many situations. 

Nativ»> l^myipiyft anH F^gljsh Proficiencv 

The i»esent findings bear on the question of how many factors are measured by the 
TOEFL (e.g., Hosley & Meredith, 1979), and Aether competence in a second language is 
unitary or multidimensimial - and, if the latter, ^irfiat is the nature and relative importance 
ot the various dimensions (e.g., sec the review by Vollmer & Sang, 1983). The study 
suggests that the profidem^ level of the sample exerts considerable influence on the test 
structure that is observed 

Implications for the TOEFL's Validity and Use 

The findings have implications for the TOEFL's validity and use. The parallels 
between the dimensions and the sections of the test support its construct validity. The 
similarity in the dimensions for the different language groups suggests that the test is 
measuring the same constructs in each group. And the greater salience of the dimensions 
for the low-scoring examinees implies that the test is measuring more differentiated and 
distinctive constructs for these mdividuals. 

This last finding also suggests that the interpretation of TOEFL section scores 
depends on the examinees' overall level of proficiency. The section scores are likely to be 
most useful for low scorers, helping to pinpoint the strengths and weaknesses of these 
individuals. In contrast, the total score is probably most useful for high*scoring examinees, 
providing ^obal information about their proficiency. Follow-up research is essential to 
confirm the need for differential score interpretations of ;his kind. 
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The results also imi^ that easy and difficult TOEFL items differ m their ability to 
measure specific language ddlk and general language profidenty. The easy items appear to 
be the best measures of specific bngnage slulls and hence may be most useful for du^ostic 
purposes; the difficult items seem to be the best measures of general proficiency and thus 
may be most rsefid for global saeening. This cntcome raises some interesting possibilities. 
One possibility woidd be to obtain additional sco^ scores based 

on easy items for dij^gnostSi and scores based cn diffkuh items for global screening, 
^lother possibility would be to alter irfiat the TOEFL measures simply by changii^ the 
difificulty of the items in the test, either aihandng its diagnostic use by employing easy items 
or strengthening its use as a global measure by em(4oying difficult items. Further work to 
understand and exploit the distinction between easy and difficult TOEFL items is dearly m 
order. 
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Figure 2 

Hierarchical Clustering, of Ite ms 




Note . Cluster composition; 

1. Listening Comprehension, easy items 

2. General Cluster, difficult items 

3. Structure and Written Expression, easy items 

4. Structure and Written Expression, medium difficulty items 

5. Vocabulary and Reading Comprehension, easy items 

6. Reading Comprehension, next to last passage, difficult items 
2 ^- Reading Comprehension, last passage, difficult items ( O 
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Figure 3 

SubsoniPle Clusters Plotted on SINDSCAL Dlnenstona 
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